:orphan:
Core Basics 1: Train, Evaluate and Deploy a Classifier
======================================================
In this lesson we will learn how to train, evaluate and deploy
classifiers with Khiops.
Make sure you have installed `Khiops `__ and
`Khiops Visualization `__.
We start by importing Khiops and defining some helper functions:
.. code:: ipython3
import os
import platform
import subprocess
from khiops import core as kh
# Define peek helper function
def peek(file_path, n=10):
"""Shows the first n lines of a file"""
with open(file_path, encoding="utf8", errors="replace") as file:
for line in file.readlines()[:n]:
print(line, end="")
print("")
# If there are any issues you may Khiops status with the following command
# kh.get_runner().print_status()
Training a Classifier
---------------------
We’ll train a classifier for the ``Iris`` dataset. This is a classical
dataset containing the data of different plants belonging to the genus
*Iris*. It contains 150 records, 50 for each of three variants of
*Iris*: *Setosa*, *Virginica* and *Versicolor*. The records for each
sample contain the length and width of its petal and sepal. The standard
task for this dataset is to construct a classifier for the type of
*Iris* taking as inputs the length and width characteristics.
Now to train a classifier with Khiops we use two types of files: - A
plain-text delimited data file (for example a ``csv`` file) - A
*dictionary* file which describes the schema of the above data table
(``.kdic`` file extension)
Let’s save into variables the locations of these files for the ``Iris``
dataset and then take a look at their contents:
.. code:: ipython3
iris_kdic = os.path.join(kh.get_samples_dir(), "Iris", "Iris.kdic")
iris_data_file = os.path.join(kh.get_samples_dir(), "Iris", "Iris.txt")
print(f"Iris dictionary file: {iris_kdic}")
peek(iris_kdic)
print(f"Iris data file: {iris_data_file}\n")
peek(iris_data_file)
.. parsed-literal::
Iris dictionary file: /github/home/khiops_data/samples/Iris/Iris.kdic
Dictionary Iris
{
Numerical SepalLength ;
Numerical SepalWidth ;
Numerical PetalLength ;
Numerical PetalWidth ;
Categorical Class ;
};
Iris data file: /github/home/khiops_data/samples/Iris/Iris.txt
SepalLength SepalWidth PetalLength PetalWidth Class
5.1 3.5 1.4 0.2 Iris-setosa
4.9 3.0 1.4 0.2 Iris-setosa
4.7 3.2 1.3 0.2 Iris-setosa
4.6 3.1 1.5 0.2 Iris-setosa
5.0 3.6 1.4 0.2 Iris-setosa
5.4 3.9 1.7 0.4 Iris-setosa
4.6 3.4 1.4 0.3 Iris-setosa
5.0 3.4 1.5 0.2 Iris-setosa
4.4 2.9 1.4 0.2 Iris-setosa
Note that the *Iris* variant information is in the column ``Class``. Now
let’s specify directory to save our results:
.. code:: ipython3
iris_results_dir = os.path.join("exercises", "Iris")
print(f"Iris results directory: {iris_results_dir}")
.. parsed-literal::
Iris results directory: exercises/Iris
We are now ready to train the classifier with the Khiops function
``train_predictor``. This method returns a tuple containing the location
of two files: - the modeling report (``AllReports.khj``): A JSON file
containing information such as the informativeness of each variable,
those selected for the model and performance metrics. - model’s
*dictionary* file (``Modeling.kdic``): This file is an enriched version
of the initial dictionary file that contains the model. It can be used
to make predictions on new data.
.. code:: ipython3
iris_report, iris_model_kdic = kh.train_predictor(
iris_kdic,
dictionary_name="Iris",
data_table_path=iris_data_file,
target_variable="Class",
results_dir=iris_results_dir,
max_trees=0, # by default Khiops constructs 10 decision tree variables
)
print(f"Iris report file: {iris_report}")
print(f"Iris modeling dictionary: {iris_model_kdic}")
.. parsed-literal::
Iris report file: exercises/Iris/AllReports.khj
Iris modeling dictionary: exercises/Iris/Modeling.kdic
You can verify that the result files were created in
``iris_results_dir``. In the next sections, we’ll use the file at
``iris_report`` to assess the models’ performances and the file at
``iris_model_kdic`` to deploy it. Now we can see the report with the
Khiops Visualization app:
.. code:: ipython3
# To visualize uncomment the line below
# kh.visualize_report(iris_report)
Exercise
~~~~~~~~
We’ll repeat the examples on this notebook with the ``Adult`` dataset.
It contains characteristics of the adult population in USA such as age,
gender and education and its task is to predict the variable ``class``,
which indicates if the individual earns ``more`` or ``less`` than 50,000
dollars.
Let’s start by putting into variables the paths for the ``Adult``
dataset:
.. code:: ipython3
adult_kdic = os.path.join(kh.get_samples_dir(), "Adult", "Adult.kdic")
adult_data_file = os.path.join(kh.get_samples_dir(), "Adult", "Adult.txt")
Print the file locations and use the function ``peek`` to list their contents
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. code:: ipython3
print(f"Adult dictionary file: {adult_kdic}")
peek(adult_kdic)
print(f"Adult data file: {adult_data_file}\n")
peek(adult_data_file)
.. parsed-literal::
Adult dictionary file: /github/home/khiops_data/samples/Adult/Adult.kdic
Dictionary Adult
{
Categorical Label ;
Numerical age ;
Categorical workclass ;
Numerical fnlwgt ;
Categorical education ;
Numerical education_num ;
Categorical marital_status ;
Adult data file: /github/home/khiops_data/samples/Adult/Adult.txt
Label age workclass fnlwgt education education_num marital_status occupation relationship race sex capital_gain capital_loss hours_per_week native_country class
1 39 State-gov 77516 Bachelors 13 Never-married Adm-clerical Not-in-family White Male 2174 0 40 United-States less
2 50 Self-emp-not-inc 83311 Bachelors 13 Married-civ-spouse Exec-managerial Husband White Male 0 0 13 United-States less
3 38 Private 215646 HS-grad 9 Divorced Handlers-cleaners Not-in-family White Male 0 0 40 United-States less
4 53 Private 234721 11th 7 Married-civ-spouse Handlers-cleaners Husband Black Male 0 0 40 United-States less
5 28 Private 338409 Bachelors 13 Married-civ-spouse Prof-specialty Wife Black Female 0 0 40 Cuba less
6 37 Private 284582 Masters 14 Married-civ-spouse Exec-managerial Wife White Female 0 0 40 United-States less
7 49 Private 160187 9th 5 Married-spouse-absent Other-service Not-in-family Black Female 0 0 16 Jamaica less
8 52 Self-emp-not-inc 209642 HS-grad 9 Married-civ-spouse Exec-managerial Husband White Male 0 0 45 United-States more
9 31 Private 45781 Masters 14 Never-married Prof-specialty Not-in-family White Female 14084 0 50 United-States more
We now save the results directory for this exercise:
.. code:: ipython3
adult_results_dir = os.path.join("exercises", "Adult")
print(f"Adult results directory: {adult_results_dir}")
.. parsed-literal::
Adult results directory: exercises/Adult
Train a classifier for the ``Adult`` database
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Note the name of the target variable is ``class`` (**in lower case!**).
Do not forget to set ``max_trees=0``. Save the resulting file locations
into the variables ``adult_report`` and ``adult_model_kdic`` and print
them
.. code:: ipython3
adult_report, adult_model_kdic = kh.train_predictor(
adult_kdic,
dictionary_name="Adult",
data_table_path=adult_data_file,
target_variable="class",
results_dir=adult_results_dir,
max_trees=0,
)
print(f"Adult report file: {adult_report}")
print(f"Adult modeling dictionary file: {adult_model_kdic}")
.. parsed-literal::
Adult report file: exercises/Adult/AllReports.khj
Adult modeling dictionary file: exercises/Adult/Modeling.kdic
Inspect the results with the Khiops Visualization app
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. code:: ipython3
# To visualize uncomment the line below
# kh.visualize_report(adult_report)
Accessing a Classifiers’ Basic Evaluation Metrics
-------------------------------------------------
We access the classifier’s evaluation metrics by loading file at
``iris_report`` file with the Khiops function
``read_analysis_results_file``:
.. code:: ipython3
iris_results = kh.read_analysis_results_file(iris_report)
print(type(iris_results))
.. parsed-literal::
The resulting object is an instance of the ``AnalysisResults`` class.
The model evaluation reports are stored in its
``train_evaluation_report`` and ``test_evaluation_report`` attributes
which are of class ``EvaluationReport``.
.. code:: ipython3
iris_train_eval = iris_results.train_evaluation_report
iris_test_eval = iris_results.test_evaluation_report
print(type(iris_train_eval))
print(type(iris_test_eval))
.. parsed-literal::
We access the default predictor’s metrics with the
``get_snb_performance`` method of the evaluation report objects:
.. code:: ipython3
iris_train_performance = iris_train_eval.get_snb_performance()
iris_test_performance = iris_test_eval.get_snb_performance()
These objects are of class ``PredictorPerformance`` and have
``accuracy`` and ``auc`` attributes for these metrics:
.. code:: ipython3
print(f"Iris train accuracy: {iris_train_performance.accuracy}")
print(f"Iris test accuracy: {iris_test_performance.accuracy}")
print("")
print(f"Iris train AUC: {iris_train_performance.auc}")
print(f"Iris test AUC: {iris_test_performance.auc}")
.. parsed-literal::
Iris train accuracy: 0.980952
Iris test accuracy: 0.955556
Iris train AUC: 0.997868
Iris test AUC: 0.984362
Exercise
~~~~~~~~
Read the contents of the file at ``adult_report`` for the Adult analysis and print its type
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. code:: ipython3
adult_results = kh.read_analysis_results_file(adult_report)
type(adult_results)
.. parsed-literal::
khiops.core.analysis_results.AnalysisResults
Save the evaluation reports of the ``Adult`` classification to the variables ``adult_train_eval`` and ``adult_test_eval``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. code:: ipython3
adult_train_eval = adult_results.train_evaluation_report
adult_test_eval = adult_results.test_evaluation_report
Show the model’s train and test accuracies and AUCs
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. code:: ipython3
adult_train_performance = adult_train_eval.get_snb_performance()
adult_test_performance = adult_test_eval.get_snb_performance()
print(f"Adult train accuracy: {adult_train_performance.accuracy}")
print(f"Adult test accuracy: {adult_test_performance.accuracy}")
print("")
print(f"Adult train AUC: {adult_train_performance.auc}")
print(f"Adult test AUC: {adult_test_performance.auc}")
.. parsed-literal::
Adult train accuracy: 0.869295
Adult test accuracy: 0.865714
Adult train AUC: 0.926145
Adult test AUC: 0.921665
Deploying a Classifier
----------------------
We are going to deploy the ``Iris`` classifier we have just trained on
the same dataset (normally we would do this on new data). We saved the
model in the file ``iris_model_kdic``. This file is usually large and
incomprehensible, so you should know what you are doing before editing
it. Just this time let’s take a quick look at its contents:
.. code:: ipython3
peek(iris_model_kdic, 25)
.. parsed-literal::
#Khiops 10.3.0
Dictionary SNB_Iris
{
Unused Numerical SepalLength ;
Unused Numerical SepalWidth ;
Unused Numerical PetalLength ;
Unused Numerical PetalWidth ;
Unused Categorical Class ;
Unused Structure(DataGrid) VClass = DataGrid(ValueSetC("Iris-setosa", "Iris-versicolor", "Iris-virginica"), Frequencies(38, 32, 35)) ;
Unused Structure(DataGrid) PPetalLength = DataGrid(IntervalBounds(3.15, 4.75, 5.15), ValueSetC("Iris-setosa", "Iris-versicolor", "Iris-virginica"), Frequencies(38, 0, 0, 0, 1, 26, 5, 0, 0, 0, 9, 26)) ; // DataGrid(PetalLength, Class)
Unused Structure(DataGrid) PPetalWidth = DataGrid(IntervalBounds(0.75, 1.75), ValueSetC("Iris-setosa", "Iris-versicolor", "Iris-virginica"), Frequencies(38, 0, 0, 0, 31, 1, 0, 2, 33)) ; // DataGrid(PetalWidth, Class)
Unused Structure(Classifier) SNBClass = SNBClassifier(Vector(0.3515625, 0.4375), DataGridStats(PPetalLength, PetalLength), DataGridStats(PPetalWidth, PetalWidth), VClass) ;
Categorical PredictedClass = TargetValue(SNBClass) ;
Unused Numerical ScoreClass = TargetProb(SNBClass) ;
Numerical `ProbClassIris-setosa` = TargetProbAt(SNBClass, "Iris-setosa") ;
Numerical `ProbClassIris-versicolor` = TargetProbAt(SNBClass, "Iris-versicolor") ;
Numerical `ProbClassIris-virginica` = TargetProbAt(SNBClass, "Iris-virginica") ;
};
Note that the modeling dictionary contains 5 used variables: - ``Class``
: The original target of the dataset - ``PredictedClass`` : The class
with the highest probability according to the model -
``ProbClassIris-setosa``, ``ProbClassIris-versicolor``,
``ProbClassIris-virginica``: The probabilities of each class according
to the model
These will be the columns of the output table when deploying the model:
.. code:: ipython3
iris_deployment_file = os.path.join(iris_results_dir, "iris_deployment.txt")
kh.deploy_model(
iris_model_kdic,
dictionary_name="SNB_Iris",
data_table_path=iris_data_file,
output_data_table_path=iris_deployment_file,
)
peek(iris_deployment_file)
.. parsed-literal::
PredictedClass ProbClassIris-setosa ProbClassIris-versicolor ProbClassIris-virginica
Iris-setosa 0.9884494887 0.008598869265 0.002951642068
Iris-setosa 0.9884494887 0.008598869265 0.002951642068
Iris-setosa 0.9884494887 0.008598869265 0.002951642068
Iris-setosa 0.9884494887 0.008598869265 0.002951642068
Iris-setosa 0.9884494887 0.008598869265 0.002951642068
Iris-setosa 0.9884494887 0.008598869265 0.002951642068
Iris-setosa 0.9884494887 0.008598869265 0.002951642068
Iris-setosa 0.9884494887 0.008598869265 0.002951642068
Iris-setosa 0.9884494887 0.008598869265 0.002951642068
Exercise
~~~~~~~~
Use the ``deploy_model`` function to deploy the model stored in the file at ``adult_model_kdic``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Which columns are deployed?
.. code:: ipython3
adult_deployment_file = os.path.join(adult_results_dir, "adult_deployment.txt")
kh.deploy_model(
adult_model_kdic,
dictionary_name="SNB_Adult",
data_table_path=adult_data_file,
output_data_table_path=adult_deployment_file,
)
peek(adult_deployment_file)
.. parsed-literal::
Predictedclass Probclassless Probclassmore
less 0.9999926658 7.33418716e-06
more 0.4122763795 0.5877236205
less 0.9624691952 0.03753080482
less 0.9158716208 0.08412837917
less 0.5717571015 0.4282428985
more 0.2594836411 0.7405163589
less 0.9939376151 0.006062384897
more 0.4223655109 0.5776344891
more 0.001798128 0.998201872